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1 Introduction 

The geometric relation between 3D objects and their 
views is a key component for various applications in com¬ 
puter vision, image coding, and animation. For example, 
the change in the 2D projection of a moving 3D object 
is a source of information for 3D reconstruction, and for 
visual recognition applications — in the former case the 
retinal changes produce the cues for 3D recovery, and in 
the latter case the retinal changes provide the cues for 
factoring out the effects of changing viewing positions on 
the recognition process. 

The introduction of affine and projective tools into 
the Held of computer vision have brought increased ac¬ 
tivity in the fields of structure from motion and recogni¬ 
tion in the recent few years. The emerging realization is 
that non-metric information, although weaker than the 
information provided by depth maps and rigid camera 
geometries, is nonetheless useful in the sense that the 
framework may provide simpler algorithms, camera cali¬ 
bration is not required, more freedom in picture-taking is 
allowed — such as taking pictures of pictures of objects 

— and there is no need to make a distinction between 
orthographic and perspective projections. The list of 
contributions to this framework include (though not in¬ 
tended to be complete) [17, 2, 30, 12, 46, 47, 13, 26, 7, 32, 
34, 36, 25, 45, 29, 8, 10, 23, 31, 16, 15, 48] — and relevant 
to this paper are the work described in [17, 7, 13, 34, 36]. 

The material introduced so far in the literature, con¬ 
cerning 3D geometry from multiple views, focuses on the 
projective framework [7, 13, 36], or the affine framework. 
The latter requires either assuming parallel projection 
(cf. [17, 46, 45, 30]), or certain apriori assumptions 
on object structure (for determining the location of the 
plane at infinity [7, 28]), or assuming purely translational 
camera motion [24] (see also later in the text). 

In this paper, we propose a unified framework that 
includes by generalization and specialization the Eu¬ 
clidean, projective and affine frameworks. The frame¬ 
work, we call “relative affine”, gives rise to an equation 
that captures most of the spectrum of previous results 
related to 3D-from-2D geometry, and introduces new, 
extremely simple, algorithms for the tasks of reconstruc¬ 
tion from multiple views, recognition by alignment, and 
certain image coding applications. For example, previ¬ 
ous results in these areas — such as affine structure from 
orthographic views, projective structure from perspec¬ 
tive views, the use of the plane at infinity for reconstruc¬ 
tion (obtaining affine structure from perspective views), 
epipolar-geometry related results, reconstruction under 
restricted camera motion (the case of pure translation) 

— are often reduced to a single-line proof under the new 
framework (see Corollaries 1 to 6). 

The basic idea is to choose a representation of projec¬ 
tive space in which an arbitrarily chosen reference plane 
becomes the plane at infinity. We then show that under 
general, uncalibrated, camera motion, the resulting new 
representations can be described by an element of the 
affine group applied to the initial representation. As a re¬ 
sult, we obtain an affine invariant, we call relative affine 
structure, relative to the initial representation. Via sev¬ 
eral corollaries of this basic result we show, among other 


things, that the invariant is a generalization of the affine 
structure under parallel projection [17] and is a special¬ 
ization of the projective structure (projective structure 
can be described as a ratio of two relative affine struc¬ 
tures). Furthermore, in computational terms the rela¬ 
tive affine result requires fewer corresponding points and 
fewer calculations than the projective framework, and is 
the only next general framework after projective when 
working with perspective views. Parts of this work, as 
it evolved, have been presented in the meetings found in 
[33, 38], and in [27], 

2 Notation 

We consider object space to be the three-dimensional 
projective space V 3 , and image space to be the two- 
dimensional projective space V 2 . An object (or scene) 
is modeled by a set of points and let i />; C V 2 denote 
views (arbitrary), indexed by i, of the object. Given two 
views with projection centers 0,0' E V 3 , respectively, 
the epipoles are defined as the intersection of the line 
00' with both image planes. A set of numbers defined 
up to scale are enclosed by brackets, a set of numbers 
enclosed by parentheses define a vector in the usual way. 
Because the image plane is finite, we can assign, without 
loss of generality, the value 1 as the third homogeneous 
coordinate to every observed image point. That is, if 
(x, y) are the observed image coordinates of some point 
(with respect to some arbitrary origin — say the geo¬ 
metric center of the image), then p = [x,y, 1] denotes 
the homogeneous coordinates of the image plane. When 
only two views are discussed, then points in ip 0 

are denoted by p, their corresponding points in t/q are 
denoted by p', and the epipoles are v E ip 0 and v' E tpi- 
When multiple views are considered, then appropriate 
indecis are added as explained later in the text. The 
symbol = denotes equality up to a scale, GL n stands 
for the group of n x n matrices, and PGL n is the group 
defined up to a scale. 

A camera coordinate system is an Euclidean frame 
describing the actual internal geometry of the camera 
(position of the image plane relative to the camera cen¬ 
ter). If p = (x, y, 1) T is a point in the observed coordi¬ 
nate representation, then M _1 p represents the camera 
coordinates, where M is an upper-diagonal matrix con¬ 
taining the internal parameters of the camera. When M 
is known, the camera is said to be internally calibrated, 
and when M = I the camera is in “standard” calibra¬ 
tion mode. The material presented in this paper does 
not require further details of internal calibration — such 
as its decomposition into the components of principle 
point, image plane aspect ratios and skew — only the 
mere existence of M is required for the remaining of this 
paper. 

3 Relative Affine Structure 

The following theorems and corollaries introduce our 
main results which are then followed by explanatory text. 

Theorem 1 (Relative Affine Structure [33]) Let 7r 

be some arbitrary plane and let Pj E 7r, j = 1,2,3 
projecting onto pj,pb in views ip 0 , t/q, respectively. Let 



p 



Figure 1: See proof of Theorem f. 


p 0 £ V’o an d p' 0 £ V’l be projections of P 0 7 r. Let 
A G PGL 3 be a homography of V 2 determined by the 
equations Apj = p(, j = 1,2,3, and Av = v', scaled to 
satisfy the equation p' g = Ap 0 + v'. Then, for any point 
P £ V 3 projecting onto p G 0 O and p' G f'i, we have 


( 1 ) 

The coefficient k = k(p) is independent of f i, i.e., is 
invariant to the choice of the second view, and the coor¬ 
dinates of P are [x,y,l,k]. 

Proof. We assign the 

coordinates (1, 0, 0, 0), (0,1,0, 0), (0, 0,1, 0) to Pi, P 2 , P 3 , 
respectively. Let O and O' be the projection centers 
associated with the views 0 O and t/’i, respectively, and 
let their coordinates be ( 0 , 0 , 0 , 1 ), ( 1 , 1 , 1 , 1 ), respectively 
(see Figure 1). This choice of representation is always 
possible because the two cameras are part, of V 3 . By 
construction, the point of intersection of the line OO' 
with 7T has th© coordinates (1, 1, 1, 0). 

Let P be some object point projecting onto p,p'. 
The line OP intersects 7r at the point (a, /3,y,0). The 
coordinates a, [3, y can be recovered by projecting the 
image plane onto 7 r, as follows. Given the epipoles 
v £ V’o and v' G f'\, we have by our choice of co¬ 
ordinates that pi,p 2 ,P 3 and v are projectively (in V 2 ) 
mapped onto ei = (1, 0, 0), e 2 = (0, 1,0), e 3 = (0, 0, 1) 
and 64 = (1,1,1), respectively. Therefore, there exists 
a unique element ,4i G PGL 3 that satisfies Aipj = ej , 
j = 1,2,3, and A\v = 64 . Note that we have made 
a choice of scale by setting A\v to 64 , this is simply for 
convenience as will be clear later on. Let Aip = {a, [3, 7 ). 

Similarly, the line O'P intersects 7r at {a', ft', y', 0). 
Let A 2 G PGL 3 be defined by A 2 p( = ej , j = 1,2,3, 
and A 2 v' = 64 . Let A 2 p' = (a', (3',y'). Since P can 
be described as a linear combination of two points along 
each of the lines OP, and O'P, we have the following 
equation: 


p' = Ap + kv' 
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from which it readily follows that k = s (i.e., the 
transformation between the two representations of V 3 
is affine). Note that since only ratios of coordinates are 
significant in V n , k is determined up to a uniform scale, 
and any point P 0 7 r can be used to set a mutual scale 
for all views — by setting an appropriate scale for ,4, for 
example. The value of k can easily be determined from 
image measurements as follows: we have 
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Multiply both sides by A n 1 to obtain p.p' = Ap + kv', 
where A = Af 1 Ai. Note that A G PGL 3 is a homogra¬ 
phy between the two image planes, due to 7 r, determined 
by pb = Apj, j = 1, 2, 3, and Av = v' (therefore, can be 
recovered directly without going through Ai,A 2 ). Simi¬ 
lar proofs that a homography of a plane can be recovered 
from three points and the epipoles are found in [34, 29]. 
Since k is determined up to a uniform scale, we need a 
fourth correspondence p 0 ,p' 0 , and lqt ,4, or v', be scaled 
such that p' g = Ap 0 + v'. Finally, [a;, y, 1, k] are the homo¬ 
geneous coordinates representation of P , and the 3x4 
matrix [A, o'] is a camera transformation matrix between 
the two views. [] 

Theorem 2 (Further Algebraic Aspects [27]) Let 

the coordinate transform from P = zp to P' = " , // be 
described by P' = .\Pll.\l /' • .\/' / . where R,T are 
the rotational and translational parameters of the rela¬ 
tive camera displacement, and M, M' are the internal 
camera parameters. Given A, 7 r, k defined in Theorem 1, 
let n be the unit normal to the plane it, and d w the (per¬ 
pendicular) distance of the origin to it , both in the first 
camera coordinate frame. Then, 


T 11 

a = M'ii; • —— i.\/ 

Ct-jr 


( 2 ) 


k = — a, 

~0 

where z 0 is the depth of P 0 it, and a = a(p) is the 
affine structure of P in the case of parallel projection 
(the ratio of perpendicular distances of P and P 0 from 
tt). 

Proof. Let P be at the intersection of the ray OP with 
tt. Then P' = M' It.\I P-MT. Since n 1 M^P = d, w , 
we have: P' = AL'(R+ —)M~ 1 P. Since the term in 
parentheses describes the homography due to tt, we have 
A = AL'(R+ — which is thggeneralization 

of the classical motion of planes in the calibrated case 
[9, 43]. For the point P we have: 

z f 1 

—p' = M'RMp + -M'T 
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Figure 2: Affine structure under parallel projection is d p /d 0 . 
This can be seen from the similarity of trapezoids followed 

p'-P 


by the similarity of triangles: ? _?i = 


_ Zjl 

d Q ’ 


Let. d p = d 7r — n T (M 1 P) the (perpendicular) distance 
from P to 7T. We thus have 


where d 0 is the (perpendicular) distance of P 0 from 7r 
(see Figure 3-a). Finally, note that the ratio a = d p /d 0 
of the distances of P and P 0 from 7r is the affine structure 
when the projection is parallel (see Figure 2). [] 

Corollary 1 Relative affine structure k approaches 
affine structure under parallel projection when O goes 
to infinity, i.e., k -- a when O -► oo. 

Proof. When O -► oo, then z, z 0 -- oo. Thus k = 

f T 0 — "t ( see Fi S ure 2 )- □ 

Corollary 2 When the plane 7r is at infinity (with re¬ 
spect to the camera coordinate frame), then relative 
affine structure k is affine structure under perspective 
k = z 0 jz, A = M'RAJ- 1 , and, if in addition, the cam¬ 
eras are internally calibrated as M = M' = I, then 

A = R. 

Proof. When 7r is at infinity, then d p ,d 0 -- oo. 

Thus k = -:• Also, d w -- oo, thus A -- 

M'llAI . (see Figure 3-b)0 

Corollary 3 (Pure Translation) In the case of pure 
translational motion of the camera, and when the inter¬ 
nal camera parameters remain fixed, i.e., M = M', then 
the selection of the identity homography A = I (in Equa¬ 
tion 1) leads to an affine reconstruction of the scene (i.e., 
the identity matrix is the homography due to the plane 
at infinity). In other words, the scalar k m 

p' = p + kv' 

is invariant under all subsequent camera motions that 
leave the internal parameters unchanged and consist of 
only translation of the camera center. The coordinates 
[x, y,l,k] are related to the camera coordinate frame by 
an element of the affine group. 


Proof. Follows immediately from Corollary 2: the ho¬ 
mography due to the plane at infinity is A = M'llAI . 
Hence, A = I when M = M' and R = I (pure transla¬ 
tional motion). [] 

Corollary 4 The projective structure of the scene can 
be described as the ratio of two relative affine structures 
each with respect to a distinct reference plane 7r, 7r, re¬ 
spectively, which in turn can be described as the ratio of 
affine structures under parallel projection with respect to 
the same two planes. 

Proof. Let k w and k^ be the relative affine structures 
with respect to planes 7r and 7r, respectively. From Theo¬ 
rem 2 we have that k w = — ^ and k^ = — i-. The ratio 

k w /kf removes the dependence on the projection center 
O {z/z 0 cancels out) and is therefore a projective invari¬ 
ant. (see Figure 4). This projective invariant, is also the 
ratio of cross-ratios of the rays OP and OP 0 with their 
intersections with the two planes 7r and tt, which was in¬ 
troduced in [34, 36] as “projective depth”. It. is also the 
ratio of two affine structures under parallel projection 
(recall that d p /d 0 is the affine structure; see Figure 2). 

□ " ^ ; 
Corollary 5 The “ essential ” matrix E = [v']R is a par¬ 
ticular case of a generalized matrix F = [t/]A. The ma¬ 
trix F, referred to as ‘ fundamental ” matrix in [7], is 
unique and does not depend on the plane tt. Further¬ 
more, Fv = 0 and F J v' = 0. 

Proof. Let. p £ (/’ 0 ,P' G V’l be two corresponding points, 
and let. l,V be their corresponding epipolar lines, i.e., 

/ = p % v and /' = p' x v'. Since lines are projective 
invariants, then any point, along l is mapped by A to 
some point, along V. Thus, V = v' x Ap, and because p' is 
incident, t.o we have p ,T (v 1 x Ap) = 0, or equivalently: 
p ,T [t)']Ap = 0, or p ,T Fp = 0, where F = From 

Corollary 2, A = R in the special case where the plane 
7T is at infinity and the cameras are internally calibrated 
as M = M' = I, thus E = [v']R is a. special case of 
F. The uniqueness of F follows from substitution of 
A with Equation 2 and noting that [C]T = 0, thus F = 
[v']M'RM _1 . Finally, since Av = v', = 0, 

thus Fv = 0, and A T [t)'] T t/ = — A T [t/]t/ = 0, thus 
F J v' = 0. [] 

Corollary 6 (stream of views) Given m > 2 views, 
let Aj and v( be the homographies of tt and the epipoles, 
respectively, from vie w fp to vie w f<j, and let the vie ws 
of an object point P be pj where the index j ranges over 
the m views. Then, the least squares solution for k is 
given by 

, HjiPi x v j) r (AjPo X Pj) 

Ej Ibi x t/1| 2 

Proof. This is simply a. calculation based on the observa¬ 
tion that given a. general equation of the type a = b+kc, 
then by performing a. cross product, with a on both sides 
we get.: k(a x c) = b x a. The value of k can be found 
using the normal equations (treating k as a. vector of 
dimension 1): 

1 (b x a) T (a x c) 

ll« x rll 2 ' 


3 






(a) (b) 

Figure 3: (a) Relative affine Structure: k = p. (b) Affine structure under perspective (when 7r is at infinity). Note that 
the rays OP and O'p are parallel, thus the homography is the rotational component of motion. 



Figure 4: Projective-depth [34, 36] is the ratio of two 
relative affine structures, each with respect to a distinct 
reference plane, which is also the ratio of two affine struc¬ 
tures (see Corollary 4 for more details). 

Similarly, if in addition we have a' = b' + kc ', then the 
overall least squares solution is given by 

(b x a) T {a x c) + (b' x a) T (a x c') p, 
k = ||a x c|| 2 + ||a' x c'|| 2 'U 

3.1 Explanatory Text 

The key idea in Theorem 1 was to use both camera cen¬ 
ters as part of the reference frame in order to show that 
the transformation between an arbitrary representation 
7 Z 0 of space as seen from the first, camera and the repre¬ 
sentation 7v as seen from any other camera position, can 
be described by an element of the affine group. In other 
words, we have chosen an arbitrary plane 7r and made a 
choice of representation 7v 0 in which 7r is the plane at in¬ 
finity (i.e., 7T was mapped to infinity — not an unfamiliar 
trick, especially in computer graphics). The representa¬ 
tion 7 Z 0 is associated with [x,y, l,fc] where k vanishes 


for all points coplanar with 7r, which means that 7r is the 
plane at infinity under the representation 7 Z 0 . What was 
left to show is that 7r remains the plane at infinity under 
all subsequent camera transformations, and therefore k 
is an affine invariant. Because k is invariant relative 
to the representation 7 Z 0 we named it “relative affine 
structure”; this should not be confused with the term 
“relative invariants” used in classical invariant theory 
(invariants multiplied by a power of the transformation 
determinant, as opposed to “absolute invariants”). 

In practical terms, the difference between a full pro¬ 
jective framework (like in [7, 13, 36]) and the relative 
affine framework can be described as follows. In a full 
projective framework, if we denote by / the invariance 
function acting on a pair of views indexed by a fixed set 
of five corresponding points, then f{4'i,4'j) is fixed for 
all i,j. In a relative affine framework, if we denote f 0 
as the invariance function acting on a fixed view r/’o and 
an arbitrary view and indexed by a fixed set of four 
corresponding points, then f 0 {4'o , V’i) is fixed for all i. 

The remaining theorem 2 and corollaries put the rela¬ 
tive affine framework within the familiar context of affine 
structure under parallel and perspective projections, Eu¬ 
clidean structure and projective structure. The homog¬ 
raphy A due to the plane 7r was described as a product 
of the rigid camera motion parameters, the parameters 
of 7T, and the internal camera parameters of both cam¬ 
eras. This result is a natural extension of the classical 
motion of planes found in [9, 43], and also in [22]. The 
relative affine structure k was described as a product of 
the affine structure under parallel projection and a term 
that contains the location of the camera center of the 
reference view. Geometrically, k is the product of two 
ratios, the first, being the ratio of the perpendicular 1 dis¬ 
tance of a, point, P to the plane 7r and the depth z to the 


1 Not,e that, the distance can be measured along any fixed 
direction. We use the perpendicular distance because it, is 
the most, natural way of describing the distance between a 
point, and a plane. 


4 






reference camera, and the second ratio is of the same 
form but applied to a fixed point P 0 which is used to 
set a uniform scale to the system. Therefore, when the 
depth goes to infinity (projection approaches orthogra¬ 
phy), then k approaches the ratio of the perpendicular 
distances of P from 7r and the perpendicular distance of 
P 0 from 7T — which is precisely the affine structure under 
parallel projection [17]. Thus, relative affine structure is 
a generalization in the sense of including the center of 
projection of an arbitrary camera, and when the cam¬ 
era center goes to infinity we obtain an affine structure 
which becomes independent of the reference camera. 

Another specialization of relative affine structure was 
shown in Corollary 2 by considering the case when 7r is 
at infinity with respect to our Euclidean frame (i.e., re¬ 
ally at infinity). In that case k is simply inverse depth 
(up to a uniform scale factor), and the homography A 
is the familiar rotational component of camera motion 
(orthogonal matrix R) in the case of calibrated cameras, 
or a product of R with the internal calibration param¬ 
eters. In other words, when 7r is at infinity also with 
respect to our camera coordinate frame, then relative 
affine becomes affine (the plane at infinity is preserved 
under all representations [7]). Notice that the rays to¬ 
wards the plane at infinity are parallel across the two 
cameras (see Figure 3-b). Thus, there exists a rotation 
matrix that aligns the two bundles of rays, and following 
this line of argument, the same rotation matrix aligns the 
epipolar lines (scaled appropriately) because orthogonal 
matrices commute with cross products. We have there¬ 
fore the algorithm of [18] for determining the rotational 
component of standard calibrated camera motion, given 
the epipoles. In practice, of course, we cannot recover 
the homography due to the plane at infinity unless we 
are given prior information on the nature of the scene 
structure [28], or the camera motion is purely transla¬ 
tional ([24] and Corollary 3). Thus in the general case, 
we can realize either the relative affine framework or the 
projective framework. 

In Corollary 3 we address a particular case in which 
we can recover the homography due to the plane at infin¬ 
ity, hence recover the affine structure of the scene. This 
is the case where the camera motion is purely transla¬ 
tional and the internal camera parameters remain fixed 
(i.e., we use the same camera for all views). This case 
was addressed in [24] by using clever and elaborate geo¬ 
metric constructions. The basic idea in [24] is that under 
pure translation of a calibrated camera, certain lines and 
points on the plane at infinity are easily constructed in 
the image plane. A line and a point from the plane at 
infinity are then used as auxiliaries for recovering the 
affine coordinates of the scene (with respect to a frame 
of four object points). 

The relative affine framework provides a single-line 
proof of the main result of [24], and Furthermore, pro¬ 
vides an extremely obvious algorithm for reconstruction 
of affine structure from a purely translating camera with 
fixed internal parameters, as follows. The epipole v' is 
the focus of expansion and is determined from two cor¬ 
responding points (v 1 = (pi x p() x (pj x p'j), for some 
i,j). Given corresponding points p,p' in the two views, 


the coordinates (x, y, k), where k satisfies p' = p + kv', 
are related to the Euclidean coordinates (with respect to 
a camera coordinate frame) by an element of the affine 
group. The scalar k is determined up to scale, thus one 
of the points, say p 0 , should determine the scale by scal¬ 
ing v' to satisfy p' 0 = p 0 + v' (note that p 0 can coincide 
with one of the points, pi or pj , used for determining v'). 
In case we would like to determine the affine coordinates 
with respect to four object points Pi, ..., P 4 , we simply 
assign the standard coordinates (0, 0, 0), (1, 0, 0), (0, 1,0) 
and (0, 0, 1) to those points, and solve for the 3D affine 
transformation that maps ( Xi,yi,ki ), i = 1,...,4, onto 
the standard coordinates (the mapping contains 12 pa¬ 
rameters, and each of the four points determines three 
linear equations). 

To conclude the implications of Corollary 3, we ob¬ 
serve that given the epipole v', we need only one more 
point match (for setting a mutual scale) in order to de¬ 
termine affine structure. This is obvious because the 
epipole is the translational component of camera mo¬ 
tion, and since this is the only motion we assume to 
have, the structure of the scene should follow without 
additional information. This case is very similar to the 
classic paradigm of stereopsis: instead of assuming that 
epipolar lines are horizontal, we recover the epipole (two 
point matches are sufficient), and instead of assuming 
a calibrated camera we assume an uncalibrated camera 
whose internal parameters remain fixed, and in turn, in¬ 
stead of recovering depth we can recover at most the 
affine structure of the scene. Finally, the result that 
the homography due to the plane at infinity is the iden¬ 
tity matrix can be derived by geometric grounds as well. 
Points and lines from the plane at infinity are fixed points 
of the homography; with an affine frame of four points 
we can observe four fixed points, and thus, a homog¬ 
raphy with four fixed points is necessarily the identity 
matrix. 

The connection between the relative affine structure 
and projective structure was shown in Corollary 4. Pro¬ 
jective invariants are necessarily described with reference 
to five scene points [7], or equivalently, with reference 
to two planes and a point laying outside of them both 
[36, 34]. Corollary 4 shows that by taking the ratio of 
two relative affine structures, each relative to a differ¬ 
ent reference plane, then the dependence on the camera 
center (the term z 0 /z) drops and we are left with the 
projective invariant described in [36], which is the ratio 
of the perpendicular distance of a point to two planes 
(up to a uniform scale factor). 

Corollary 5 unifies previous results on the nature of 
what is known by now as the “fundamental matrix” 
[7, 8]. It is shown, that for any plane 7r and its cor¬ 
responding homography A we have F = [t/]A. First, 
we see that given a homography, the epipole v' follows 
by having two corresponding points coming from scene 
points not coplanar with 7r — an observation that was 
originally made by [18]. Second, F is fixed, regardless 
of the choice of 7r, which was shown by using the result 
of Theorem 2. As a particular case, the product [v']R 
remains fixed if we add to if a element that vanishes 
as a product with [ 1 /] — an observation that was made 



previously by [13]. Thirdly, the “essential” matrix [19], 
E = [v']R, is shown to be a specialization of F in the 
case 7T is at infinity with respect to the world coordi¬ 
nate frame and the cameras are internally calibrated as 
M = M' = I. 

Finally, Corollary 6 provides a practical formula for 
obtaining a least-squares estimation of relative affine 
structure which also applies for the case where a stream 
of views is available — in the spirit of [46, 42, 23, 41, 1, 5 ]. 
In the next section we apply these results to obtain a 
simple algorithm for relative affine reconstruction from 
multiple m > 2 views and multiple points. 

3.2 Application I: Reconstruction from a 
Stream of Views 

Taken together, the results above demonstrate the abil¬ 
ity to compute relative affine structure using many 
points over many views in a least squares manner. At 
minimum we need two views and four corresponding 
points and the corresponding epipoles to recover k for 
all other points of the scene whose projections onto the 
two views are given. Let pij, i = 0, ..., n and j = 0, ..., m 
denote the i’th image point on frame j. Let Aj denote 
the homography from frame 0 to frame j, Vj , the corre¬ 
sponding epipoles such that AjVj = i >'•, and let ki denote 
the relative affine structure of point i. We follow these 
steps: 

1. Compute epipoles 

Vj,v'j using the relation PijFjPi 0 = 0, over all i. 
Eight corresponding points (frame 0 and frame j ) 
are needed for a linear solution, and a least-squares 
solution is possible if more points are available. In 
practice the best results were obtained using the 
non-linear algorithm of [21]. The epipoles follow by 
FjVj = 0 and F T v'j =0 [7]. The latter readily fol¬ 
lows from Corollary 5 as Iv'AAjVj = \v'Av'- = 0 and 

A J W Tv 'i = ~ A J KM = °- 

2. Compute Aj from the equations AjPi 0 = pij, i = 
1,2,3, and AjVj = i >). This leads to a linear set 
of eight equations for solving for Aj up to a scale. 
A least squares solution is available from the equa¬ 
tion pij [vj\AjPi 0 = 0 for all additional points (Corol¬ 
lary 5 ). Scale Aj to satisfy p 0 j = Ajp 00 + i >). 

3. Relative affine structure ki is given by (3). 

3.3 Application II: Recognition by Alignment 

The relative affine invariance relation, captured by The¬ 
orem 1, can be used for visual recognition by alignment 
([44, 14], and references therein). In other words, the 
invariance of k can be used to “re-project” the object 
onto any third view p", as follows. Given two “model” 

views in full correspondence pi <->■ p[, i = 1, ...,n, we 

recover the epipoles and homography A from Api = p[, 
i = 1,2,3, and Av = v'. Then the corresponding points 
p'l in any third view satisfy p" = Bp + kv" , for some 
matrix B and epipole v". One can solve for B and v" 
by observing six corresponding points between the first 
and third view. Once B,v" are recovered, we can find 
the estimated location of p” for the remaining points 


Pi, i = 7, ..., n, by first solving for ki from the equation 
p'i = Api + kiv ', and then substituting the result in the 
equation p'l = Bpi +kiv". Recognition is achieved if the 
distance between p” and p”, i = 7, ...,n, is sufficiently 
small. Other methods for achieving reprojection include 
the epipolar intersection method (cf. [26, 6, 11]), or by 
using projective structure instead of the relative affine 
structure [34, 36]. In all the above methods the epipolar 
geometry plays a key and preconditioned role. More di¬ 
rect methods, that do not require the epipolar geometry 
can be found in [35, 37]. 

3.4 Application III: Image Coding 

The re-projection paradigm, described in the previous 
section, can serve as a principle for model-based im¬ 
age compression. In a sender/receiver mode, the sender 
computes the relative affine structure between two ex¬ 
treme views of a sequence, and sends the first view, 
the relative affine scalars, and the homographies and 
epipoles between the first frame and all the intermediate 
frames. The intermediate frames can be reconstructed 
by re-projection. Alternatively, the sender send the two 
extreme views and the homographies and epipoles be¬ 
tween the first and all other intermediate views. The 
receiver recovers the correspondence Held between the 
two extreme views, and then synthesizes the remaining 
views from the received parameters of homographies and 
epipoles. In case the distance between the two extreme 
views is “moderate”, we found that optical flow tech¬ 
niques can be useful for the stage of obtaining the corre¬ 
spondence field between the views. Experiments can be 
found later in the text, and more detailed experiments 
concerning the use of optical flow in full registration of 
images for purposes of model-based image compression 
can be found in [4]. 

4 Experimental Results 

The following experiments were conducted to illustrate 
the applications that arise from the relative affine frame¬ 
work (reconstruction, recognition by alignment, and im¬ 
age coding) and to test the algorithms on real data. The 
performance under real imaging situations is interesting, 
in particular, because of the presence of deviations from 
the pin-hole camera model (radial distortions, decenter¬ 
ing, and other effects), and due to errors in obtaining 
image correspondences. 

Fig. 5 shows four views, out of a sequence of ten views, 
of the object we selected for experiments. The object is 
a sneaker with added texture to facilitate the correspon¬ 
dence process. This object was chosen because of its 
complexity, i.e., it has a shape of a natural object and 
cannot easily be described parameterically (as a collec¬ 
tion of planes or algebraic surfaces). A set of thirty-four 
points were manually selected on one of the frames, re¬ 
ferred to as the first frame, and their correspondences 
were automatically obtained along all other frames used 
in this experiment (corresponding points are marked by 
overlapping squares in Fig. 5). The correspondence pro¬ 
cess is based on an implementation of a coarse-to-fine 
optical-flow algorithm based on [20] and described in [3]. 




(C) (d) 


Figure 5: Four views, out of a sequence of ten views, of a sneaker. The frames shown here are the first, second, fifth and 
tenth of the sequence (top-bottom, left-to-right). The overlayed squares mark the corresponding points that were tracked and 
subsequently used for our experiments. 


7 








(C) 


(b) 


Figure 6: Results of 3D reconstruction of the collection of sample points, (a) Frontal view (aligned with the first, frame of 
the sneaker). The two bottom displays show a side view of the sample, (b) Result of recovering structure between the first 
and tenth frame (large base-line); (c) Result of recovery between the first and second frames (small base-line). 



(a) (b) 


Figure 7: Results of re-projection onto the tenth frame. Epipoles were recovered using the ground plane homography (see 
text). The re-projected points are marked by crosses, and should be in the center of their corresponding square for accurate 
re-projection, (a) Structure was recovered between the first and fifth frames, then re-projected onto the tenth frame (large 
base-line). Average error is 1.1 pixels with std of 0.98. (b) Structure was recovered between the first and second frames (small 
base-line situation) and then re-projected onto the tenth frame. Average error is 7.81 pixels with std of 6.5. 
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Epipoles were recovered by either one of the following 
two methods. First, by using the four ground points to 
recover the homography A, and then by Corollary 5 to 
compute the epipoles using all the remaining points in 
a least squares manner. Second, using the non-linear 
algorithm proposed by [21]. The two methods gave rise 
to very similar results for reconstruction, and slightly 
different results for re-projection (see later). 

In the reconstruction paradigm, we recovered relative 
affine structure from two views and multiple views. In 
the two-view case we used either a small base-line (the 
first two views of the sequence) or a large base-line (the 
first and last views of the sequence). In the multiple 
view case, we used all ten views of the sequence (Corol¬ 
lary 6). The transformation to Euclidean coordinates 
was done for purposes of display by assuming that the 
ground plane is parallel to the image plane (it actually 
is not) and that the camera is calibrated (there was no 
calibration attempt made). 

The 3D coordinates are shown in Fig. 6. Display (a) 
shows a frontal view (in order to visually align the dis¬ 
play with the image of the sneaker). Other displays show 
a side view of the reconstructed sneaker under the fol¬ 
lowing experimental situations. Display (b) is due to 
reconstruction under large base-line situation (the two 
methods for obtaining the epipoles produced very simi¬ 
lar results; the multiple-view case produced very similar 
results as well). The side view illustrates the robustness 
of the reconstruction process, as it was obtained by rotat¬ 
ing the object around a different axis than the one used 
for capturing the images. Display (c) is due to recon¬ 
struction under small base-line situation (both methods 
for obtaining the epipoles produced very similar results). 
The quality of reconstruction in the latter case is not as 
good as in the former, as should be expected. Never¬ 
theless, the system does not totally brake-down under 
relatively small base-line situations and produces a rea¬ 
sonable result under these circumstances. 

In the re-projection application (see Section 3.3), rel¬ 
ative affine structure was recovered using the first and 
in-between views, and re-projected onto the last view of 
the sequence. Note that this is an extrapolation exam¬ 
ple, thereby performance is expected to be poorer than 
interpolation examples, i.e., when the re-projected view 
is in-between the model views. The interpolation case 
will be discussed in the next section, where relevance to 
image coding applications is argued for. 

In general, the performance was better when the 
ground plane was used for recovering the epipoles. When 
the intermediate view was the fifth in the sequence 
(Fig. 5, display (c)), the average error in re-projection 
was 1.1 pixels (with standard deviation of 0.98 pixels). 
When the intermediate view was the second frame in the 
sequence (Fig. 5, display (b)), the results were poorer 
(due to small base-line and large extrapolation) with av¬ 
erage error of 7.81 pixels (standard deviation of 6.5). 
These two cases are displayed in Fig. 7. The re-projected 
points are represented by crosses overlayed on the last 
frame (the re-projected view). 

When the second method for computing the epipoles 
was used (more general, but generally less accurate), the 


results were as follows. With the fifth frame, the aver¬ 
age error was 1.62 pixels (standard deviation of 1.2); and 
with the second frame (small base-line situation) the av¬ 
erage error was 13.87 pixels (standard deviation of 9.47). 
These two cases are displayed in Fig. 8. Note that be¬ 
cause all points were used for recovering the epipoles, the 
re-projection performance, only indicates the level of ac¬ 
curacy one can obtain when all the information is being 
used. In practice we would like to use much fewer points 
from the re-projected view, and therefore, re-projection 
methods that avoid the epipoles all together would be 
preferred — an example of such a method can be found 
in [35, 37], 

For the image coding paradigm (see Section 3.4), rel¬ 
ative affine structure of the 34 sample points were com¬ 
puted between the first and last frame of the ten frame 
sequence (displays (a) and (d) in Fig. 5). Display (a) 
in Fig. 9 shows a graph of the average re-projection er¬ 
ror for all the intermediate frames (from second to ninth 
frames). Display (b) shows the relative error normalized 
by the distance between corresponding points across the 
sequence. We see that the relative error generally goes 
down as the re-projected frame is farther from the first 
frame (increase of base-line). In all frames, the average 
error is less than 1 pixel, indicating a relatively robust 
performance in practice. 

5 Summary 

The framework of “relative affine” was introduced and 
shown to be general and sharper than the projective re¬ 
sults for purposes of 3D reconstruction from multiple 
views and for the task of recognition by alignment. One 
of the key ideas in this work is to define and recover 
an invariant that stands in the middle ground between 
affine and projective. The middle ground is achieved 
by having the camera center of one arbitrary view as 
part of the projective reference frame (of five points), 
thus obtaining the first result described in Theorem 1 
(originally in [33]). The result simply states that un¬ 
der general uncalibrated camera motion, the sharpest 
result we can obtain is that all the degrees of freedom 
are captured by four points (thus the scene may un¬ 
dergo at most 3D affine transformations) and a single 
unknown projective transformation (from the arbitrary 
viewer-centered representation 1Z 0 to the camera coor¬ 
dinate frame). The invariants that are obtained in this 
way are viewer-centered since the camera center is part of 
the reference frame and are called “relative affine struc¬ 
ture”. This statement, that all the available degrees of 
freedom are captured by four points and one projective 
transformation, was also recently presented in [40] using 
different notations and tools than those used here and in 
[33, 38], 

This “middle ground” approach has several advan¬ 
tages. First, the results are sharper than a full projec¬ 
tive reconstruction approach ([7, 13]) where five scene 
points are needed. The increased sharpness translates 
to a remarkably simple framework captured by a single 
equation (Equation 1). Second, the manner in which 
the results were derived provides the means for unifying 
a wide range of other previous results, thus obtaining a 




(a) (b) 


Figure 8: Re-projection onto the tenth frame. Epipoles are computed via fundamental matrix (see text) using the implemen¬ 
tation of [21]. (a) Large base situation (structure computed between first and fifth frames): average error 1.62 with std of 1.2. 
(b) Small base-line situation (structure computed between first and second frames): average error 13.87 with std of 9.47. 
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(a) (b) 

Figure 9: Error in re-projection onto the intermediate frames (2-9). Structure was computed between frames one and ten. 
(a) average error in pixels, (b) relative error normalized by the displacement between corresponding points. 
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canonical framework. Following Theorem 2, the corollar¬ 
ies show how this “middle ground” reduces back to full 
affine structure and extends into full projective struc¬ 
ture (Corollaries 1 and 4). The corollaries also show 
how the “plane at infinity” is easily manipulated in this 
framework, thereby making further connections among 
projective affine and Euclidean results in general and less 
general situations (Corollaries 2 and 3). The corollaries 
also unify the various results related to the epipolar ge¬ 
ometry of two views: the Essential matrix of [19], the 
Fundamental matrix of [7] and other related results of 
[13] (Corollary 5). All the above connections and re¬ 
sults are often obtained as a single-line proof and follow 
naturally from the relative affine framework. 

Finally, the relative affine result has proven useful 
for derivation of other results and applications, some of 
which can be found in [39, 37, 35]. The derivation of 
those results critically rely on the simplicity of the rela¬ 
tive affine framework, and in some cases [37, 35] on the 
sharpness of the framework compared to the projective 
framework. 
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